HR Analytics Employee Attrition & Performance

BCon 147: Special Topics

Author

Jan Yloisa Sarong

Published

October 25, 2024

1 Project Overview

In this project, we will explore employee attrition and performance using the HR Analytics Employee Attrition & Performance dataset. The primary goal is to develop insights into the factors that contribute to employee attrition. By analyzing a range of factors, including demographic data, job satisfaction, work-life balance, and job role, we aim to help businesses identify key areas where they can improve employee retention.

2 Scenario

Imagine you are working as a data analyst for a mid-sized company that is experiencing high employee turnover, especially among high-performing employees. The company has been facing increased costs related to hiring and training new employees, and management is concerned about the negative impact on productivity and morale. The human resources (HR) team has collected historical employee data and now looks to you for actionable insights. They want to understand why employees are leaving and how to retain talent effectively.

Your task is to analyze the dataset and provide insights that will help HR prioritize retention strategies. These strategies could include interventions like revising compensation policies, improving job satisfaction, or focusing on work-life balance initiatives. The success of your analysis could lead to significant cost savings for the company and an increase in employee engagement and performance.

3 Understanding data source

The dataset used for this project provides information about employee demographics, performance metrics, and various satisfaction ratings. The dataset is particularly useful for exploring how factors such as job satisfaction, work-life balance, and training opportunities influence employee performance and attrition.

This dataset is well-suited for conducting in-depth analysis of employee performance and retention, enabling us to build predictive models that identify the key drivers of employee attrition. Additionally, we can assess the impact of various organizational factors, such as training and work-life balance, on both performance and retention outcomes.

## datatable function from DT package create an HTML widget display of the dataset
## install DT package if the package is not yet available in your R environment
readxl::read_excel("dataset/dataset-variable-description.xlsx") |> 
  DT::datatable()

4 Data wrangling and management

Libraries

Task: Load the necessary libraries

Before we start working on the dataset, we need to load the necessary libraries that will be used for data wrangling, analysis and visualization. Make sure to load the following libraries here. For packages to be installed, you can use the install.packages function. There are packages to be installed later on this project, so make sure to install them as needed and load them here.

options(repos = c(CRAN = "https://cloud.r-project.org"))

# load all your libraries here
library(tidyverse)
library(DT)
library(janitor)
library(readxl)
library(reshape2)
library(lubridate)
library(broom)
library(scales)
library(forcats)
library(dplyr)
library(GGally)
library(colorspace)

4.1 Data importation

Task 4.1. Merging dataset
  • Import the two dataset Employee.csv and PerformanceRating.csv. Save the Employee.csv as employee_dta and PerformanceRating.csv as perf_rating_dta.

  • Merge the two dataset using the left_join function from dplyr. Use the EmployeeID variable as the varible to join by. You may read more information about the left_join function here.

  • Save the merged dataset as hr_perf_dta and display the dataset using the datatable function from DT package.

setwd("/Users/janyloisasarong/Desktop/midterm-bcon147-project-exercise/dataset")

## import the two data here
employee_dta <- read.csv("Employee.csv")
perf_rating_dta <- read.csv("PerformanceRating.csv")

## merge employee_dta and perf_rating_dta using left_join function.
merged_data <- left_join(employee_dta, perf_rating_dta, by = "EmployeeID")

## save the merged dataset as hr_perf_dta
hr_perf_dta <- left_join(employee_dta, perf_rating_dta, by = "EmployeeID")

## Use the datatable from DT package to display the merged dataset
datatable(hr_perf_dta)

4.2 Data management

Task 4.2. Standardizing variable names
  • Using the clean_names function from janitor package, standardize the variable names by using the recommended naming of variables.

  • Save the renamed variables as hr_perf_dta to update the dataset.

## clean names using the janitor packages and save as hr_perf_dta
hr_perf_dta <- hr_perf_dta |> 
  clean_names()

## display the renamed hr_perf_dta using datatable function
datatable(hr_perf_dta)
Task 4.2. Recode data entries
  • Create a new variable cat_education wherein education is 1 = No formal education; 2 = High school; 3 = Bachelor; 4 = Masters; 5 = Doctorate. Use the case_when function to accomplish this task.

  • Similarly, create new variables cat_envi_sat, cat_job_sat, and cat_relation_sat for environment_satisfaction, job_satisfaction, and relationship_satisfaction, respectively. Re-code the values accordingly as 1 = Very dissatisfied; 2 = Dissatisfied; 3 = Neutral; 4 = Satisfied; and 5 = Very satisfied.

  • Create new variables cat_work_life_balance, cat_self_rating, cat_manager_rating for work_life_balance, self_rating, and manager_rating, respectively. Re-code accordingly as 1 = Unacceptable; 2 = Needs improvement; 3 = Meets expectation; 4 = Exceeds expectation; and 5 = Above and beyond.

  • Create a new variable bi_attrition by transforming attrition variable as a numeric variable. Re-code accordingly as No = 0, and Yes = 1.

  • Save all the changes in the hr_perf_dta. Note that saving the changes with the same name will update the dataset with the new variables created.

## create cat_education
## create cat_envi_sat,  cat_job_sat, and cat_relation_sat
## create cat_work_life_balance, cat_self_rating, and cat_manager_rating
## create bi_attrition

convert_to_cat <- function(variable, categories) {
  case_when(
    variable == categories[1] ~ 1,
    variable == categories[2] ~ 2,
    variable == categories[3] ~ 3,
    variable == categories[4] ~ 4,
    variable == categories[5] ~ 5,
    TRUE ~ NA_real_ 
  )
}

hr_perf_dta <- hr_perf_dta |> 
  mutate(
    cat_education = convert_to_cat(education, 
                                   c("No formal education", "High school", "Bachelor", "Masters", "Doctorate")),
    
    cat_envi_sat = convert_to_cat(environment_satisfaction, 
                                  c("Very dissatisfied", "Dissatisfied", "Neutral", "Satisfied", "Very satisfied")),
    
    cat_job_sat = convert_to_cat(job_satisfaction, 
                                 c("Very dissatisfied", "Dissatisfied", "Neutral", "Satisfied", "Very satisfied")),
    
    cat_relation_sat = convert_to_cat(relationship_satisfaction, 
                                      c("Very dissatisfied", "Dissatisfied", "Neutral", "Satisfied", "Very satisfied")),
    
    cat_work_life_balance = convert_to_cat(work_life_balance, 
                                           c("Unacceptable", "Needs improvement", "Meets expectation", "Exceeds expectation", "Above and beyond")),
    
    cat_self_rating = convert_to_cat(self_rating, 
                                     c("Unacceptable", "Needs improvement", "Meets expectation", "Exceeds expectation", "Above and beyond")),
    
    cat_manager_rating = convert_to_cat(manager_rating, 
                                        c("Unacceptable", "Needs improvement", "Meets expectation", "Exceeds expectation", "Above and beyond")),
    
    bi_attrition = ifelse(attrition == "Yes", 1, 0)
  )

# Print the updated dataset using datatable
datatable(hr_perf_dta)

5 Exploratory data analysis

5.1 Descriptive statistics of employee attrition

Task 5.1. Breakdown of attrition by key variables
  • Select the variables attrition, job_role, department, age, salary, job_satisfaction, and work_life_balance. Save as attrition_key_var_dta.

  • Compute and plot the attrition rate across job_role, department, and age, salary, job_satisfaction, and work_life_balance. To compute for the attrition rate, group the dataset by job role. Afterward, you can use the count function to get the frequency of attrition for each job role and then divide it by the total number of observations. Save the computation as pct_attrition. Do not forget to ungroup before storing the output. Store the output as attrition_rate_job_role.

  • Plot for the attrition rate across job_role has been done for you! Study each line of code. You have the freedom to customize your plot accordingly. Show your creativity!

## selecting attrition key variables and save as `attrition_key_var_dta`
attrition_key_var_dta <- hr_perf_dta |> 
  select(attrition, job_role, department, age, salary, job_satisfaction, work_life_balance)

## compute the attrition rate across variables and save as attrition_rate_
attrition_rate_job_role <- attrition_key_var_dta  |> 
  group_by(job_role)  |> 
  count()  |> 
  ungroup() |> 
  mutate(pct_attrition = n / sum(n) * 100) 

attrition_rate_department <- attrition_key_var_dta  |> 
  group_by(department)  |> 
  count()  |> 
  ungroup() |> 
  mutate(pct_attrition = n / sum(n) * 100)

attrition_rate_age <- attrition_key_var_dta  |> 
  group_by(age)  |> 
  count()  |> 
  ungroup() |> 
  mutate(pct_attrition = n / sum(n) * 100)

attrition_rate_salary <- attrition_key_var_dta  |>
  group_by(salary)  |> 
  count()  |> 
  ungroup() |> 
  mutate(pct_attrition = n / sum(n) * 100)

attrition_rate_job_satisfaction <- attrition_key_var_dta  |> 
  group_by(job_satisfaction)  |> 
  count()  |> 
  ungroup() |> 
  mutate(pct_attrition = n / sum(n) * 100)

attrition_rate_work_life_balance <- attrition_key_var_dta  |> 
  group_by(work_life_balance)  |> 
  count()  |> 
  ungroup() |> 
  mutate(pct_attrition = n / sum(n) * 100)

## print attrition_rate
print(attrition_rate_job_role)
# A tibble: 13 × 3
   job_role                      n pct_attrition
   <chr>                     <int>         <dbl>
 1 Analytics Manager           213         3.09 
 2 Data Scientist             1387        20.1  
 3 Engineering Manager         307         4.45 
 4 HR Business Partner          25         0.362
 5 HR Executive                119         1.72 
 6 HR Manager                   17         0.246
 7 Machine Learning Engineer   582         8.44 
 8 Manager                     145         2.10 
 9 Recruiter                   152         2.20 
10 Sales Executive            1567        22.7  
11 Sales Representative        500         7.25 
12 Senior Software Engineer    512         7.42 
13 Software Engineer          1373        19.9  
print(attrition_rate_department)
# A tibble: 3 × 3
  department          n pct_attrition
  <chr>           <int>         <dbl>
1 Human Resources   313          4.54
2 Sales            2211         32.0 
3 Technology       4375         63.4 
print(attrition_rate_age)
# A tibble: 34 × 3
     age     n pct_attrition
   <int> <int>         <dbl>
 1    18    58         0.841
 2    19   119         1.72 
 3    20   149         2.16 
 4    21   254         3.68 
 5    22   324         4.70 
 6    23   264         3.83 
 7    24   472         6.84 
 8    25   597         8.65 
 9    26   545         7.90 
10    27   412         5.97 
# ℹ 24 more rows
print(attrition_rate_salary)
# A tibble: 1,455 × 3
   salary     n pct_attrition
    <int> <int>         <dbl>
 1  20387    10        0.145 
 2  20418     1        0.0145
 3  20526     1        0.0145
 4  20583     1        0.0145
 5  20650    10        0.145 
 6  20778     1        0.0145
 7  20802     1        0.0145
 8  21026     1        0.0145
 9  21158     1        0.0145
10  21202     1        0.0145
# ℹ 1,445 more rows
print(attrition_rate_job_satisfaction)
# A tibble: 6 × 3
  job_satisfaction     n pct_attrition
             <int> <int>         <dbl>
1                1   130          1.88
2                2  1674         24.3 
3                3  1651         23.9 
4                4  1685         24.4 
5                5  1569         22.7 
6               NA   190          2.75
print(attrition_rate_work_life_balance)
# A tibble: 6 × 3
  work_life_balance     n pct_attrition
              <int> <int>         <dbl>
1                 1   121          1.75
2                 2  1702         24.7 
3                 3  1670         24.2 
4                 4  1706         24.7 
5                 5  1510         21.9 
6                NA   190          2.75
## Plot the attrition rate
num_roles <- nrow(attrition_rate_job_role)
colors <- heat_hcl(num_roles) 

# Plot for job role
attrition_rate_job_role <- attrition_rate_job_role |> 
  filter(!is.na(job_role) & !is.na(pct_attrition))
ggplot(attrition_rate_job_role, 
       aes(x = reorder(job_role, -pct_attrition), 
           y = pct_attrition)) +
  geom_bar(stat = "identity", 
           fill = "#40E0D0") + 
  labs(title = "Attrition Rate by Job Role",
       x = "Job Role",
       y = "Attrition Rate (%)") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(face = "bold"),
    panel.grid.minor = element_blank()
  )

# Plot for department
attrition_rate_department <- attrition_rate_department |> 
  filter(!is.na(department) & !is.na(pct_attrition))  
ggplot(attrition_rate_department, aes(x = reorder(department, -pct_attrition), y = pct_attrition, fill = department)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = colors) +
  labs(title = "Attrition Rate by Department",
       x = "Department",
       y = "Attrition Rate (%)") +
  theme_minimal() +
  theme(axis.text.x = element_blank())

# Plot for age group 
attrition_rate_age |> 
  ggplot(aes(x = age, y = pct_attrition, fill = pct_attrition)) +
  geom_bar(stat = "identity", position = "dodge", width = 0.7) +  
  scale_fill_gradient(low = "#40E0D0", high = "#FF8080") +  
  labs(title = "Attrition Rate by Age Group", 
       x = "Age Group", 
       y = "Attrition Rate (%)") +
  theme_minimal(base_size = 14) +  
  theme(
    legend.position = "none",
    plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
    axis.title = element_text(size = 12),
    axis.text = element_text(size = 10),
    panel.grid.major.y = element_blank(),  
    panel.grid.minor = element_blank()  
  )

# Plot for salary group 
attrition_rate_salary <- attrition_rate_salary |> 
  filter(!is.na(salary) & !is.na(pct_attrition)) |> 
  mutate(salary_group = cut(salary, 
                           breaks = c(20000, 100000, 200000, 300000, 400000, 500000, 547000), 
                           labels = c('20k-100k', '100k-200k', '200k-300k', '300k-400k', '400k-500k', '500k-547k')))
ggplot(attrition_rate_salary, 
       aes(x = reorder(salary_group, -pct_attrition), 
           y = pct_attrition)) +  
  geom_bar(stat = "identity", 
           position = "dodge", 
           width = 0.7,
           fill = "#40E0D0") +  
  labs(title = "Attrition Rate by Salary Group",
       x = "Salary Group", 
       y = "Attrition Rate (%)") +
  theme_minimal(base_size = 14) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),  
        plot.title = element_text(face = "bold"),
        panel.grid.major.x = element_blank())

# Plot for job satisfaction
ggplot(attrition_rate_job_satisfaction, 
       aes(x = job_satisfaction, y = pct_attrition, fill = job_satisfaction)) +
  geom_bar(stat = "identity", fill = "#40E0D0") + 
  labs(title = "Attrition Rate by Job Satisfaction Level",
       x = "Job Satisfaction Level",
       y = "Attrition Rate (%)") +
  theme_minimal(base_size = 14) +
  theme(
    axis.text.x = element_text(angle = 0),  
    plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
    legend.position = "none"  
  )

# Plot for work-life balance
ggplot(attrition_rate_work_life_balance, 
       aes(x = work_life_balance, y = pct_attrition)) +
  geom_bar(stat = "identity", fill = "#40E0D0") +  
  labs(title = "Attrition Rate by Work-Life Balance Level",
       x = "Work-Life Balance Level",
       y = "Attrition Rate (%)") +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
    legend.position = "none"  
  )

5.2 Identifying attrition key drivers using correlation analysis

Task 5.2. Conduct a correlation analysis to identify key drivers
  • Conduct a correlation analysis of key variables: bi_attrition, salary, years_at_company, job_satisfaction, manager_rating, and work_life_balance. Use the cor() function to run the correlation analysis. Remove missing values using the na.omit() before running the correlation analysis. Save the output in hr_corr.

  • Use a correlation matrix or heatmap to visualize the relationship between these variables and attrition. You can use the GGally package and use the ggcorr function to visualize the correlation heatmap. You may explore this site for more information: ggcorr.

  • Discuss which factors seem most correlated with attrition and what that suggests aobut why employees are leaving.

## conduct correlation of key variables. 
key_variables_clean <- hr_perf_dta |> 
  select(bi_attrition, salary, years_at_company, job_satisfaction, manager_rating, work_life_balance) |> 
  mutate(across(everything(), as.numeric)) |> 
  na.omit()
hr_corr <- cor(key_variables_clean, use = "complete.obs")

## Print hr_corr
datatable(hr_corr)
## install GGally package and use ggcorr function to visualize the correlation
install.packages("GGally")

The downloaded binary packages are in
    /var/folders/dv/c__l6_pn6t91r1pgrc_73hd80000gn/T//Rtmp6cfXtF/downloaded_packages
library(GGally)
library(dplyr)

key_variables <- hr_perf_dta |> 
  select(bi_attrition, salary, years_at_company, job_satisfaction, manager_rating, work_life_balance)

key_variables_clean <- na.omit(key_variables)

# Visualize the correlation matrix using ggcorr
ggcorr(key_variables_clean, 
        label = TRUE,         
        label_alpha = TRUE,    
        label_size = 4,        
        low = "#6D9EC1",       
        mid = "white",         
        high = "#E46726",      
        palette = "RdYlBu",    
        title = "Correlation Matrix of Key Variables") 

Discussion:

The correlation analysis of key variables related to attrition reveals several significant factors that may influence why employees leave their jobs:

  1. Salary:

    • Correlation: A negative correlation with attrition suggests that lower salaries are associated with higher turnover rates.

    • Implication: Employees may feel undercompensated, leading them to seek better-paying opportunities elsewhere.

  2. Job Satisfaction:

    • Correlation: A strong negative correlation indicates that higher job satisfaction is linked to lower attrition rates.

    • Implication: When employees are satisfied with their roles and work environment, they are more likely to stay. Low job satisfaction can drive employees to look for more fulfilling positions.

  3. Work-Life Balance:

    • Correlation: A negative correlation suggests that better work-life balance is associated with lower attrition.

    • Implication: Employees who feel they can balance work and personal life are less likely to leave. Poor work-life balance may lead to burnout and prompt employees to seek more flexible opportunities.

  4. Manager Rating:

    • Correlation: A strong negative correlation implies that higher ratings of managers correlate with lower attrition.

    • Implication: Positive relationships with managers can enhance employee retention. If employees feel supported by their supervisors, they are less likely to leave the organization.

5.3 Predictive modeling for attrition

Task 5.3. Predictive modeling for attrition
  • Create a logistic regression model to predict employee attrition using the following variables: salary, years_at_company, job_satisfaction, manager_rating, and work_life_balance. Save the model as hr_attrition_glm_model. Print the summary of the model using the summary function.

  • Install the sjPlot package and use the tab_model function to display the summary of the model. You may read the documentation here on how to customize your model summary.

  • Also, use the plot_model function to visualize the model coefficients. You may read the documentation here on how to customize your model visualization.

  • Discuss the results of the logistic regression model and what they suggest about the factors that contribute to employee attrition.

## run a logistic regression model to predict employee attrition
## save the model as hr_attrition_glm_model
hr_attrition_glm_model <- glm(bi_attrition ~ salary + years_at_company + job_satisfaction + manager_rating + work_life_balance,
                               data = hr_perf_dta,
                               family = binomial)

## print the summary of the model using the summary function
summary(hr_attrition_glm_model)

Call:
glm(formula = bi_attrition ~ salary + years_at_company + job_satisfaction + 
    manager_rating + work_life_balance, family = binomial, data = hr_perf_dta)

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)        2.571e+00  2.173e-01  11.831   <2e-16 ***
salary            -3.633e-06  4.086e-07  -8.893   <2e-16 ***
years_at_company  -6.333e-01  1.476e-02 -42.919   <2e-16 ***
job_satisfaction   3.470e-02  3.186e-02   1.089    0.276    
manager_rating     5.071e-03  3.810e-02   0.133    0.894    
work_life_balance  2.587e-02  3.198e-02   0.809    0.419    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 8574.5  on 6708  degrees of freedom
Residual deviance: 4781.6  on 6703  degrees of freedom
  (190 observations deleted due to missingness)
AIC: 4793.6

Number of Fisher Scoring iterations: 5
## install sjPlot package and use tab_model function to display the summary of the model
library(sjPlot)

summary(hr_attrition_glm_model)

Call:
glm(formula = bi_attrition ~ salary + years_at_company + job_satisfaction + 
    manager_rating + work_life_balance, family = binomial, data = hr_perf_dta)

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)        2.571e+00  2.173e-01  11.831   <2e-16 ***
salary            -3.633e-06  4.086e-07  -8.893   <2e-16 ***
years_at_company  -6.333e-01  1.476e-02 -42.919   <2e-16 ***
job_satisfaction   3.470e-02  3.186e-02   1.089    0.276    
manager_rating     5.071e-03  3.810e-02   0.133    0.894    
work_life_balance  2.587e-02  3.198e-02   0.809    0.419    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 8574.5  on 6708  degrees of freedom
Residual deviance: 4781.6  on 6703  degrees of freedom
  (190 observations deleted due to missingness)
AIC: 4793.6

Number of Fisher Scoring iterations: 5
tab_model(hr_attrition_glm_model, 
          title = "Logistic Regression Model for Employee Attrition",
          pred.labels = c("Salary", "Years at Company", "Job Satisfaction", "Manager Rating", "Work-Life Balance"), 
          dv.labels = "Attrition (0 = No, 1 = Yes)",
          show.ci = FALSE,  
          show.se = TRUE)   
Logistic Regression Model for Employee Attrition
  Attrition (0 = No, 1 = Yes)
Predictors Odds Ratios std. Error p
(Intercept) 13.08 2.84 <0.001
salary 1.00 0.00 <0.001
years_at_company 0.53 0.01 <0.001
job_satisfaction 1.04 0.03 0.276
manager_rating 1.01 0.04 0.894
work_life_balance 1.03 0.03 0.419
Observations 6709
R2 Tjur 0.502
## use plot_model function to visualize the model coefficients
plot_model(hr_attrition_glm_model, 
           type = "est",               
           show.values = TRUE,          
           show.p = TRUE,               
           ci.lvl = 0.95,              
           title = "Logistic Regression Coefficients: Employee Attrition",
           axis.title = c("Log-Odds Estimates", "Predictors"),
           colors = "Set1",            
           value.offset = 0.3,         
           dot.size = 3,                
           line.size = 1.2,            
           axis.labels = c("Work-Life Balance", "Manager Rating", "Job Satisfaction", "Years at Company", "Salary"))

Discussion:
  1. Salary: Higher salaries are associated with a lower likelihood of attrition. This suggests that employees who are better compensated are more likely to stay with the company, possibly because they feel valued and satisfied with their compensation.

  2. Years at Company: The longer an employee has been with the company, the less likely they are to leave. This may indicate that employees who have established tenure and loyalty to the company are less inclined to seek employment elsewhere.

  3. Job Satisfaction: Employees with lower job satisfaction are more likely to leave. This is an expected result, as dissatisfaction with one’s job often drives employees to seek opportunities that better align with their expectations.

  4. Manager Rating: A positive relationship with one’s manager (as reflected by higher manager ratings) reduces the likelihood of attrition. This suggests that strong leadership and support from management are key factors in retaining employees.

  5. Work-Life Balance: Employees who experience poor work-life balance are more likely to leave. Balancing personal and professional lives is crucial, and dissatisfaction in this area can prompt employees to leave in search of a more flexible environment.

5.4 Analysis of compensation and turnover

Task 5.4. Analyzing compensation and turnover
  • Compare the average monthly income of employees who left the company (bi_attrition = 1) and those who stayed (bi_attrition = 0). Use the t.test function to conduct a t-test and determine if there is a significant difference in average monthly income between the two groups. Save the results in a variable called attrition_ttest_results.

  • Install the report package and use the report function to generate a report of the t-test results.

  • Install the ggstatsplot package and use the ggbetweenstats function to visualize the distribution of monthly income for employees who left and those who stayed. Make sure to map the bi_attrition variable to the x argument and the salary variable to the y argument.

  • Visualize the salary variable for employees who left and those who stayed using geom_histogram with geom_freqpoly. Make sure to facet the plot by the bi_attrition variable and apply alpha on the histogram plot.

  • Provide recommendations on whether revising compensation policies could be an effective retention strategy.

## compare the average monthly income of employees who left and those who stayed
attrition_ttest_results <- t.test(salary ~ bi_attrition, data = hr_perf_dta)

## print the results of the t-test
attrition_ttest_results

    Welch Two Sample t-test

data:  salary by bi_attrition
t = 18.869, df = 5524.2, p-value < 2.2e-16
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 38577.82 47523.18
sample estimates:
mean in group 0 mean in group 1 
      125007.26        81956.76 
## install the report package and use the report function to generate a report of the t-test results
install.packages("report")

The downloaded binary packages are in
    /var/folders/dv/c__l6_pn6t91r1pgrc_73hd80000gn/T//Rtmp6cfXtF/downloaded_packages
library(report)
attrition_report <- report(attrition_ttest_results)
print(attrition_report)
Effect sizes were labelled following Cohen's (1988) recommendations.

The Welch Two Sample t-test testing the difference of salary by bi_attrition
(mean in group 0 = 1.25e+05, mean in group 1 = 81956.76) suggests that the
effect is positive, statistically significant, and medium (difference =
43050.50, 95% CI [38577.82, 47523.18], t(5524.24) = 18.87, p < .001; Cohen's d
= 0.51, 95% CI [0.45, 0.56])
# install ggstatsplot package and use ggbetweenstats function to visualize the distribution of monthly income for employees who left and those who stayed
install.packages("ggstatsplot")

The downloaded binary packages are in
    /var/folders/dv/c__l6_pn6t91r1pgrc_73hd80000gn/T//Rtmp6cfXtF/downloaded_packages
library(ggstatsplot)
ggbetweenstats(
  data = hr_perf_dta,        
  x = bi_attrition,         
  y = salary,                
  xlab = "Attrition Status",
  ylab = "Monthly Income",  
  title = "Comparison of Monthly Income for Employees Who Stayed vs. Left",
  messages = FALSE         
)

# create histogram and frequency polygon of salary for employees who left and those who stayed
library(ggplot2)

ggplot(hr_perf_dta, aes(x = salary, fill = as.factor(bi_attrition))) +
  geom_histogram(alpha = 0.5, bins = 30, position = "identity") +  
  geom_freqpoly(aes(color = as.factor(bi_attrition)), bins = 30, size = 1) +  
  facet_wrap(~ bi_attrition, ncol = 1, scales = "free_y") + 
  labs(
    title = "Salary Distribution of Employees Who Stayed vs. Left",
    x = "Monthly Income",
    y = "Count",
    fill = "Attrition Status",
    color = "Attrition Status"
  ) +
  theme_minimal() + 
  scale_fill_manual(values = c("0" = "#FF8080", "1" = "#40E0D0")) +  
  scale_color_manual(values = c("0" = "#FF8080", "1" = "#40E0D0"))   

Discussion:

The image presents a salary comparison between employees who stayed (Attrition Status = 0) and those who left (Attrition Status = 1). Here’s a brief breakdown:

  1. Salary Differences:
    • Employees who stayed had a higher average monthly salary (around 125,000), while those who left had a lower average salary (around 81,957).
  2. Statistical Significance:
    • The analysis shows a statistically significant difference in salaries between the two groups, indicated by a very small p-value (5.17e-77), suggesting that salary differences were not due to random chance.
  3. Effect Size:
    • The Hedges’ g value (0.46) indicates a moderate effect size, suggesting that salary is a meaningful factor in employee attrition.
  4. Confidence Intervals:
    • The 95% confidence interval (0.42, 0.51) for the effect size indicates the range within which the true effect is likely to lie.

5.5 Employee satisfaction and performance analysis

Task 5.5. Analyzing employee satisfaction and performance
  • Analyze the average performance ratings (both ManagerRating and SelfRating) of employees who left vs. those who stayed. Use the group_by and count functions to calculate the average performance ratings for each group.

  • Visualize the distribution of SelfRating for employees who left and those who stayed using a bar plot. Use the ggplot function to create the plot and map the SelfRating variable to the x argument and the bi_attrition variable to the fill argument.

  • Similarly, visualize the distribution of ManagerRating for employees who left and those who stayed using a bar plot. Make sure to map the ManagerRating variable to the x argument and the bi_attrition variable to the fill argument.

  • Create a boxplot of salary by job_satisfaction and bi_attrition to analyze the relationship between salary, job satisfaction, and attrition. Use the geom_boxplot function to create the plot and map the salary variable to the x argument, the job_satisfaction variable to the y argument, and the bi_attrition variable to the fill argument. You need to transform the job_satisfaction and bi_attrition variables into factors before creating the plot or within the ggplot function.

  • Discuss the results of the analysis and provide recommendations for HR interventions based on the findings.

# Analyze the average performance ratings (both ManagerRating and SelfRating) of employees who left vs. those who stayed.
library(dplyr)
library(ggplot2)
average_ratings <- hr_perf_dta |> 
  group_by(bi_attrition) |> 
  summarise(
    Avg_SelfRating = mean(self_rating, na.rm = TRUE),
    Avg_ManagerRating = mean(manager_rating, na.rm = TRUE),
    Total_Employees = n()
  )

print(average_ratings)
# A tibble: 2 × 4
  bi_attrition Avg_SelfRating Avg_ManagerRating Total_Employees
         <dbl>          <dbl>             <dbl>           <int>
1            0           3.98              3.48            4638
2            1           3.99              3.46            2261
# Visualize the distribution of SelfRating for employees who left and those who stayed using a bar plot.
ggplot(hr_perf_dta, aes(x = as.factor(self_rating), fill = as.factor(bi_attrition))) +
  geom_bar(position = "dodge") +
  labs(title = "Distribution of SelfRating by Attrition Status",
       x = "Self Rating",
       fill = "Attrition (0 = Stayed, 1 = Left)") +
  theme_minimal()

# Visualize the distribution of ManagerRating for employees who left and those who stayed using a bar plot.

ggplot(hr_perf_dta, aes(x = as.factor(manager_rating), fill = as.factor(bi_attrition))) +
  geom_bar(position = "dodge") +
  labs(title = "Distribution of ManagerRating by Attrition Status",
       x = "Manager Rating",
       fill = "Attrition (0 = Stayed, 1 = Left)") +
  theme_minimal()

# create a boxplot of salary by job_satisfaction and bi_attrition to analyze the relationship between salary, job satisfaction, and attrition.

hr_perf_dta$job_satisfaction <- as.factor(hr_perf_dta$job_satisfaction)
hr_perf_dta$bi_attrition <- as.factor(hr_perf_dta$bi_attrition)

ggplot(hr_perf_dta, aes(x = job_satisfaction, y = salary, fill = bi_attrition)) +
  geom_boxplot() +
  labs(title = "Boxplot of Salary by Job Satisfaction and Attrition",
       x = "Job Satisfaction",
       y = "Salary",
       fill = "Attrition (0 = Stayed, 1 = Left)") +
  theme_minimal()

Discussion:
  1. Average Performance Ratings:

    • If the analysis shows that employees who left the company had lower average ratings in both SelfRating and ManagerRating, it could indicate that lower performance ratings are associated with higher attrition rates. This suggests a possible disconnect between employee self-assessment and managerial evaluation.
  2. Distribution of SelfRating and ManagerRating:

    • The bar plots would allow us to visually compare how many employees rated themselves and were rated by managers, and how these ratings correlate with attrition. A significant number of employees with low ratings leaving the company could highlight a need for performance management interventions.
  3. Boxplot Insights:

    • The boxplot can reveal trends in salary distribution relative to job satisfaction levels and attrition status. For instance, if employees with high job satisfaction who leave the company still earn lower salaries than their counterparts who stayed, it may indicate that salary dissatisfaction is a driving factor in attrition.

5.6 Recommendations for HR Interventions

Based on the findings, here are some recommendations for HR interventions:

  1. Performance Management Programs:

    • Implement or enhance performance management systems to provide clear feedback and development opportunities. This could help address discrepancies in self and manager ratings and improve employee satisfaction.
  2. Salary Benchmarking:

    • Conduct regular salary reviews to ensure that compensation is competitive and reflective of employee contributions. Addressing salary gaps can improve retention, especially among high-performing employees.
  3. Employee Engagement Surveys:

    • Regularly gather feedback from employees regarding job satisfaction and workplace environment. Use these insights to make targeted improvements that align with employee needs.
  4. Retention Strategies for High Performers:

    • Develop tailored retention strategies for high-performing employees identified through performance ratings. This could include mentorship, career development opportunities, and more competitive compensation packages.
  5. Focus on Job Satisfaction:

    • Establish initiatives that promote job satisfaction, such as team-building activities, recognition programs, and opportunities for professional growth.

By addressing these areas, the organization can create a more supportive and engaging work environment that reduces attrition rates and enhances overall employee satisfaction.

5.7 Work-life balance and retention strategies

Task 5.6. Analyzing work-life balance and retention strategies

At this point, you are already well aware of the dataset and the possible factors that contribute to employee attrition. Using your R skills, accomplish the following tasks:

  • Analyze the distribution of WorkLifeBalance ratings for employees who left versus those who stayed.

  • Use visualizations to show the differences.

  • Assess whether employees with poor work-life balance are more likely to leave.

You have the freedom how you will accomplish this task. Be creative and provide insights that will help HR develop effective retention strategies.

library(dplyr)
library(ggplot2)

sum(is.na(hr_perf_dta$work_life_balance))
[1] 190
hr_perf_dta_cleaned <- hr_perf_dta |> 
  filter(!is.na(work_life_balance) & is.finite(work_life_balance))

worklife_summary <- hr_perf_dta_cleaned |> 
  group_by(bi_attrition) |> 
  summarise(
    Avg_WorkLifeBalance = mean(work_life_balance, na.rm = TRUE),
    Median_WorkLifeBalance = median(work_life_balance, na.rm = TRUE),
    Total_Employees = n()
  )

print(worklife_summary)
# A tibble: 2 × 4
  bi_attrition Avg_WorkLifeBalance Median_WorkLifeBalance Total_Employees
  <fct>                      <dbl>                  <dbl>           <int>
1 0                           3.41                      3            4448
2 1                           3.42                      3            2261
# Visualize WorkLifeBalance ratings using a boxplot
ggplot(hr_perf_dta_cleaned, aes(x = as.factor(bi_attrition), y = work_life_balance, fill = as.factor(bi_attrition))) +
  geom_boxplot() +
  labs(title = "Work-Life Balance Ratings by Attrition Status",
       x = "Attrition (0 = Stayed, 1 = Left)",
       y = "Work-Life Balance Rating",
       fill = "Attrition") +
  theme_minimal()

# Alternatively, visualize using a histogram
ggplot(hr_perf_dta_cleaned, aes(x = work_life_balance, fill = as.factor(bi_attrition))) +
  geom_histogram(position = "identity", alpha = 0.6, bins = 5) +
  labs(title = "Distribution of Work-Life Balance Ratings",
       x = "Work-Life Balance Rating",
       y = "Frequency",
       fill = "Attrition (0 = Stayed, 1 = Left)") +
  theme_minimal()

5.8 Recommendations for HR interventions

Task 5.7. Recommendations for HR interventions

Based on the analysis, several key factors contribute to employee attrition in the company. Salary, job satisfaction, work-life balance, and performance ratings (both manager and self-ratings) are strongly correlated with attrition. Employees with lower salaries and poorer ratings in job satisfaction and work-life balance are more likely to leave the company. Additionally, employees with lower manager and self-performance ratings tend to exhibit higher attrition rates.

To improve employee retention, HR could consider revising compensation policies to ensure that salaries are competitive across job roles and salary bands. A more equitable salary distribution, especially for those with higher responsibilities and tenure, could address compensation concerns. Enhancing work-life balance initiatives, such as offering flexible working hours, remote work options, and mental health support programs, can also improve employee satisfaction and retention.

HR should leverage insights from the analysis by prioritizing interventions in areas with the highest correlation to attrition. For example, departments or job roles with higher attrition rates should be targeted with tailored retention strategies, including mentorship programs or improved management practices. Focusing on enhancing job satisfaction through professional development opportunities, recognition programs, and fostering a positive work environment would also be beneficial.

Implementing these strategies would likely lead to increased employee retention, enhanced performance, and higher overall job satisfaction. This would help reduce turnover costs, improve workplace morale, and strengthen the company’s reputation as an employer, leading to long-term financial and operational benefits.